14 research outputs found
Online Continual Learning of End-to-End Speech Recognition Models
Continual Learning, also known as Lifelong Learning, aims to continually
learn from new data as it becomes available. While prior research on continual
learning in automatic speech recognition has focused on the adaptation of
models across multiple different speech recognition tasks, in this paper we
propose an experimental setting for \textit{online continual learning} for
automatic speech recognition of a single task. Specifically focusing on the
case where additional training data for the same task becomes available
incrementally over time, we demonstrate the effectiveness of performing
incremental model updates to end-to-end speech recognition models with an
online Gradient Episodic Memory (GEM) method. Moreover, we show that with
online continual learning and a selective sampling strategy, we can maintain an
accuracy that is similar to retraining a model from scratch while requiring
significantly lower computation costs. We have also verified our method with
self-supervised learning (SSL) features.Comment: Accepted at InterSpeech 202
SpatialCodec: Neural Spatial Speech Coding
In this work, we address the challenge of encoding speech captured by a
microphone array using deep learning techniques with the aim of preserving and
accurately reconstructing crucial spatial cues embedded in multi-channel
recordings. We propose a neural spatial audio coding framework that achieves a
high compression ratio, leveraging single-channel neural sub-band codec and
SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec
is designed to encode the reference channel with low bit rates, and (ii), a
SpatialCodec captures relative spatial information for accurate multi-channel
reconstruction at the decoder end. In addition, we also propose novel
evaluation metrics to assess the spatial cue preservation: (i) spatial
similarity, which calculates cosine similarity on a spatially intuitive
beamspace, and (ii), beamformed audio quality. Our system shows superior
spatial performance compared with high bitrate baselines and black-box neural
architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo.
Codes and models are available at https://github.com/XZWY/SpatialCodec.Comment: Paper in Submissio
TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement
Speech enhancement models have greatly progressed in recent years, but still
show limits in perceptual quality of their speech outputs. We propose an
objective for perceptual quality based on temporal acoustic parameters. These
are fundamental speech features that play an essential role in various
applications, including speaker recognition and paralinguistic analysis. We
provide a differentiable estimator for four categories of low-level acoustic
descriptors involving: frequency-related parameters, energy or
amplitude-related parameters, spectral balance parameters, and temporal
features. Unlike prior work that looks at aggregated acoustic parameters or a
few categories of acoustic parameters, our temporal acoustic parameter (TAP)
loss enables auxiliary optimization and improvement of many fine-grain speech
characteristics in enhancement workflows. We show that adding TAPLoss as an
auxiliary objective in speech enhancement produces speech with improved
perceptual quality and intelligibility. We use data from the Deep Noise
Suppression 2020 Challenge to demonstrate that both time-domain models and
time-frequency domain models can benefit from our method.Comment: Accepted at ICASSP 202
PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement
Despite rapid advancement in recent years, current speech enhancement models
often produce speech that differs in perceptual quality from real clean speech.
We propose a learning objective that formalizes differences in perceptual
quality, by using domain knowledge of acoustic-phonetics. We identify temporal
acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. --
that are non-differentiable, and we develop a neural network estimator that can
accurately predict their time-series values across an utterance. We also model
phoneme-specific weights for each feature, as the acoustic parameters are known
to show different behavior in different phonemes. We can add this criterion as
an auxiliary loss to any model that produces speech, to optimize speech outputs
to match the values of clean speech in these features. Experimentally we show
that it improves speech enhancement workflows in both time-domain and
time-frequency domain, as measured by standard evaluation metrics. We also
provide an analysis of phoneme-dependent improvement on acoustic parameters,
demonstrating the additional interpretability that our method provides. This
analysis can suggest which features are currently the bottleneck for
improvement.Comment: Accepted at ICASSP 202
Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions
Enhancing speech signal quality in adverse acoustic environments is a
persistent challenge in speech processing. Existing deep learning based
enhancement methods often struggle to effectively remove background noise and
reverberation in real-world scenarios, hampering listening experiences. To
address these challenges, we propose a novel approach that uses pre-trained
generative methods to resynthesize clean, anechoic speech from degraded inputs.
This study leverages pre-trained vocoder or codec models to synthesize
high-quality speech while enhancing robustness in challenging scenarios.
Generative methods effectively handle information loss in speech signals,
resulting in regenerated speech that has improved fidelity and reduced
artifacts. By harnessing the capabilities of pre-trained models, we achieve
faithful reproduction of the original speech in adverse conditions.
Experimental evaluations on both simulated datasets and realistic samples
demonstrate the effectiveness and robustness of our proposed methods.
Especially by leveraging codec, we achieve superior subjective scores for both
simulated and realistic recordings. The generated speech exhibits enhanced
audio quality, reduced background noise, and reverberation. Our findings
highlight the potential of pre-trained generative techniques in speech
processing, particularly in scenarios where traditional methods falter. Demos
are available at https://whmrtm.github.io/SoundResynthesis.Comment: Paper in submissio
State and Market in Socialist Development: the case of Chinese industrial planning
SUMMARY This article examines the capacities and limits of the socialist state as an instrument of industrialisation in China. Chinese experience suggests that state involvement at all stages of socialist industrialisation should become more selective in its scope and more flexible in its managerial forms. It highlights the importance of developing a lively industrial microeconomy and striking a balance between state agencies and industrial enterprises. RESUMEN Estado y mercado en el desarrollo socialista: el caso de la planificación industrial china. Este articulo examina la capacidad y límites del estado socialista como instrumento de industrialización en China. La experiencia de este país sugiere que la participación estatal en todas las etapas de la industrialización socialista, debería ser más selectiva en su campo de acción y más flexible en sus formas administrativas. Subraya la importancia de desarrollar una microeconomía industrial dinámica y de lograr un equilibrio entre las agencias estatales y las empresas industriales. RESUMES Etat et marché dans le développement socialiste: le cas de la planification industrielle en Chine Cet article examine les capacités et les limites de l'état socialiste en tant qu'instrument d'industrialisation en Chine. L'expérience chinoise suggère que l'implication de l'état à tous les niveaux de l'industrialisation socialiste devrait devenir plus sélective dans son but et plus flexible dans ses formes directoriales. Elle souligne l'importance de développer une micróeconomie industrielle bien vivante et d'établir un équilibre entre les agences d'état et les entreprises industrielles
Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
The ability to learn new concepts sequentially is a major weakness for modern neural networks, which hinders their use in non-stationary environments. Their propensity to fit the current data distribution to the detriment of the past acquired knowledge leads to the catastrophic forgetting issue. In this work we tackle the problem of Spoken Language Understanding applied to a continual learning setting. We first define a class-incremental scenario for the SLURP dataset. Then, we propose three knowledge distillation (KD) approaches to mitigate forgetting for a sequence-to-sequence transformer model: the first KD method is applied to the encoder output (audio-KD), and the other two work on the decoder output, either directly on the token-level (tok-KD) or on the sequence-level (seq-KD) distributions. We show that the seq-KD substantially improves all the performance metrics, and its combination with the audio-KD further decreases the average WER and enhances the entity prediction metric